question naturalness data set
Implemented earlier for questions that do not take keywords and questions that do take one.
The question that takes two is mada.
-----
Data 1: 600 questions taking 0 to 1 keywords
Data 2: (unused) 6,000 cases under the same conditions as Data 1
Data 3: Active learning, using data 1 to train the model and selecting only those with 0.1 to 0.9
Data 4: Question text that takes two keywords, with one of them filled in with an X
question naturalness data set
Human input
Questions about it
Unnatural (0)/Natural (1)/Blank (0.5)
Distressed(1)/Not distressed(0)
The training data was gathered in a haphazard way, so the training part was created.
feature value
body (of a machine)
context (of a passage)
position of appearance
I forgot to include it in the original data template.
Keyword, Question ID
Let's just output this one and paste in an additional one.
feature generation
Keyword Body Features
Search input by keyword to find the first occurrence
take up a position with
For questions where keywords are not present, use the entire sentence and the beginning and end of the sentence.
Now I'm asking all the questions for a random choice of sentences.
That's why there are so many of them.
I made 600 cases, but the amount of original text is 13.
Right now, the input text is fixed, and the keywords in it are selected to make question candidates.
But with this, what happens if the keyword doesn't exist in the input, or if the keyword is there but it's not good (like "w").
Should there be an option to say "(go through this input and ask a question about the previous input)"?
Active learning next?
If the ones that are deemed not good enough are no longer used, we won't be able to collect training data.
Trade-offs between use and exploration
Let's just cut corners and go with Epsilon Greedy.
---
This page is auto-translated from /nishio/質問自然度データセット. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.